Distributional Evidence and Beyond: the Success and Limitations of Machine Learning in Chinese Word Segmentation

Authors: Jianqiang Ma, Dale Gerdemann

Research in Computing Science, Vol. 70, pp. 17-30, 2013.

Abstract: In this paper, we argue that the key to the success of the current state-of-art statistical learning algorithms for Chinese word segmentation (CWS) mostly lies in their optimal weighting of non-overlapping distributional evidence in the corpora. The utilization of distributional evidence is more essential than the learning algorithm. We further analyze the characteristics of distributional evidence for CWS, under the framework of Zipf’s law and summarize the limitation of statistical learning in CWS as the feature absence problem, which may be apparent yet usually neglected. Making a connection between theoretical/empirical linguistics and CWS, we suggest that the study and development of a generative word formation system may be beneficial for both the science and engineering of CWS. We wrap up the discussion after reviewing some recent works that are already on this line.

PDF: Distributional Evidence and Beyond: the Success and Limitations of Machine Learning in Chinese Word Segmentation
PDF: Distributional Evidence and Beyond: the Success and Limitations of Machine Learning in Chinese Word Segmentation